Improving Speech Recognizers by Refining Broadcast Data with Inaccurate Subtitle Timestamps

نویسندگان

Jeong-Uk Bang

Mu-Yeol Choi

Sang-Hun Kim

Oh-Wook Kwon

چکیده

This paper proposes an automatic method to refine broadcast data collected every week for efficient acoustic model training. For training acoustic models, we use only audio signals, subtitle texts, and subtitle timestamps accompanied by recorded broadcast programs. However, the subtitle timestamps are often inaccurate due to inherent characteristics of closed captioning. In the proposed method, we remove subtitle texts with low subtitle quality index, concatenate adjacent subtitle texts into a merged subtitle text, and correct the timestamp of the merged subtitle text by adding a margin. Then, a speech recognizer is used to obtain a hypothesis text from the speech segment corresponding to the merged subtitle text. Finally, the refined speech segments to be used for acoustic model training, are generated by selecting the subparts of the merged subtitle text that matches the hypothesis text. It is shown that the acoustic models trained by using refined broadcast data give significantly higher speech recognition accuracy than those trained by using raw broadcast data. Consequently, the proposed method can efficiently refine a large amount of broadcast data with inaccurate timestamps taking about half of the time, compared with the previous approaches.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improved ROVER using Language Model Information

In the standard approach to speech recognition, the goal is to find the sentence hypothesis that maximizes the posterior probability of the word sequence given the acoustic observation. Usually speech recognizers are evaluated by measuring the word error so that there is a mismatch between the training and the evaluation criterion. Recently, algorithms for minimizing directly the word error and...

متن کامل

Combining multiple speech recognizers using voting and language model information

In 1997, NIST introduced a voting scheme called ROVER for combining word scripts produced by different speech recognizers. This approach has achieved a relative word error reduction of up to 20% when used to combine the systems’ outputs from the 1998 and 1999 Broadcast News evaluations. Recently, there has been increasing interest in using this technique. This paper provides an analysis of seve...

متن کامل

Real-time speech-generated subtitles: problems and solutions

This paper refers to work carried out in the Subspeak project [1] in which we are investigating the use of speech recognition in live television subtitling. Research to date has shown that with current speech recognition technology it is not possible to achieve a satisfactory level of accuracy in the direct transcription of broadcast material. To circumvent this problem in our system the broadc...

متن کامل

Combining forward-based and backward-based decoders for improved speech recognition performance

Combining outputs of speech recognizers is a known way of increasing speech recognition performance. The ROVER approach handles efficiently such combinations. In this paper we show that the best performance is not achieved by combining the outputs of the best set of recognizers, but rather by combining outputs of recognizers that rely on different processing components, and in particular on a d...

متن کامل

Lightly supervised and unsupervised acoustic model training

The last decade has witnessed substantial progress in speech recognition technology, with todays state-of-the-art systems being able to transcribe unrestricted broadcast news audio data with a word error of about 20%. However, acoustic model development for these recognizers relies on the availability of large amounts of manually transcribed training data. Obtaining such data is both time-consu...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2017

Improving Speech Recognizers by Refining Broadcast Data with Inaccurate Subtitle Timestamps

نویسندگان

چکیده

منابع مشابه

Improved ROVER using Language Model Information

Combining multiple speech recognizers using voting and language model information

Real-time speech-generated subtitles: problems and solutions

Combining forward-based and backward-based decoders for improved speech recognition performance

Lightly supervised and unsupervised acoustic model training

عنوان ژورنال:

اشتراک گذاری